Workbench Add-On

home *** CD-ROM | disk | FTP | other *** search

/ Workbench Add-On / Workbench Add-On - Volume 1.iso / DiskUtil / Crunch / XPK / DOCS / HUFF.DOC < prev next >

Wrap

Text File | 1992-08-10 | 10KB | 218 lines

HUFF A dynamic huffman cruncher/decruncher Version 0.62 Copyright 1992 by M.Zimmermann License/Disclaimer ------------------ This library may be freely distributed with the XPK compression package, as long as it is kept in its original, complete, and unmodified form. It may not be distributed by itself or in a commercial package of any kind without my permission. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. About Huffmans -------------- The idea of a huffman crunch is as follows: often used bytes (ie 8 bit codes) get a shorter code than 8 bits, fi 5 bits. So everytime one of these bytes occurs in the source file I save (in this example) 3 bits in the dest file. To find out the optimum codes for each byte there is a simple method: A tree is to be constructed so that the path from the root to each leaf reflects the optimum (ie huffman) code. Unfortunately most computers (the Amiga, too) are byte-oriented, which means a rather complex handling of codes that are not a multiple of 8 bits. This results in pretty slow compression & decompression. So this means that the xpkHUFF.library probably won't be used for normal circumstances, but, as Dominik stated, it may serve well as an example library. There are three different huffman crunch algorithms known: · static compression/decompression · dynamic compression/decompression · adaptive compression/decompression What are the differences? The static huffman uses a fix table to represent each byte of the source. This, of course, makes sense only, if the structure of the data to be crunched is known. In this case (for instance crunching an english text) a fix table of codes is embedded in the code. Crunching other data than what the table was generated for will probably result in worse compression or even expansion. This is what a dynamic huffman is avoiding: it first creates a statistics table reflecting the frequency every byte occurs with and generates a special tree/table for this case, so the dynamic huffman does a good compression for this special data. But there is something that can be improved, anyway: imagine, there is a data block which contains many 'a's in it's first part and many 'z's in the last part.... The dynamic huffman would represent the 'a's and 'z's with short codes, of course. But it probably would be even better if the crunch/decrunch tree would reflect the *current* data beeing processed instead of the whole block, thus in resulting shorter codes for 'a' and 'z', depending of the position in the data block. This is what an adaptive huffman deals with: it creates the tree while crunching, without doing any statistics or tree creation before crunching. It permanently updates it's internal huffman tree. Therefore it doesn't have to include the information about the decrunch tree in the crunched data. Final words about huffmans ... ------------------------------ A stand-alone huffman will never achieve crunch results as fine as those reached with most other crunchers, for these will not only regard the number of occurances for each byte (as huffman does), but sequels of byte, too. This means: If you create all permutations of a datablock, the huffman crunched will always have the same length. Others won't, as they are depending on the order of the crunched data, too. Description ----------- The library 'xpkHUFF.library' implements a dynamic huffman crunch algorithm, even though the adaptive might result in slightly better crunch results. However, this is more complex to implement and I'm using a maximum buffer size of 64K, so this is a little bit like an adaptive huffman for large files. If I should have lots of spare time I will probably implement an adaptive huffman crunch algorithm. This new library will be called xpkHUFF, too, and new xpkHUFF.libraries will always handle output generated by earlier versions. The xpkHUFF.library supports a totally unsafe (but a little bit better than simple eor :-) encryption. Please note that crunch/decrunch speeds decrease when encryption is used. The source ... -------------- The source is provided, so you might have a look at different decrunch codes. Please note: The only code I tested intensively is the byte/cache decrunch code, which was used to create the library included in this package. For it's the fasted code, you probably won't want to use another one. However, this might help you to understand what huffmans are about. If you have never written a huffman, have a look at it, a huffman code is a pretty good idea. If you *have* written a huffman code and it's *faster* than mine, please let me know, I will then update xpkHUFF.library. Implementation -------------- If you should see an errormessage saying output buffer too small while crunching *and* encrypting, this means you tried to crunch and encrypt a file that would crunched and encrypted be larger than the original file. This should occur only with very small files (for I have a minimum file size due to tables) or with files that have been crunched already and therefore would expand during crunch. A technical note: this could also happen, if the last chunk of a file to be crunched/encrypted would be dimensioned too small by xpkmaster.library. However, in this case you cannot encrypt the fcnt = in_idx - anchor) > 2) { if (cnt <= 18) /* short rle */ { ++cnt_srle; *out_idx++ = cnt - 3; *out_idx++ = c; } else /* long rle */ { ++cnt_lrle; cnt -= 19; *out_idx++ = 16 + (cnt & 0x0F); *out_idx++ = cnt >> 4; *out_idx++ = c; } ctrl_bits = (ctrl_bits << 1) | 1; continue; } /* look for pattern if 2 or more characters remain in the input buffer */ in_idx = anchor; if ((inbuff_end - in_idx) > 2) { /* locate the offste of possible pattern i sliding dictionary */ ++cnt_hashes; hash = ((((in_idx[0] & 15) << 8) | in_idx[1]) ^ ((in_idx[0] >> 4) | (in_idx[2] << 4))) & hash_len; pat_idx = hash_tbl[hash]; hash_tbl[hash] = in_idx; /* compare characters if we're within 4098 bytes */ if ((gap = in_idx - pat_idx) <= 4098) { while (in_idx < inbuff_end && pat_idx < anchor && *pat_idx == *in_idx && (in_idx - anchor) < 271) { in_idx++; pat_idx++; } /* store pattern if it's more than 2 characters */ if ((cnt = in_idx - anchor) > 2) { gap -= 3; if (cnt <= 15)/* short pattern */ { ++cnt_spat; *out_idx++ = (cnt << 4) + (gap & 0x0F); *out_idx++ = gap >> 4; } else { ++cnt_lpat; *out_idx++ = 32 + (gap & 0x0F); *out_idx++ = gap >> 4; *out_idx++ = cnt - 16; } ctrl_bits = (ctrl_bits << 1) | 1; continue; } } } /* can't compress this character so copy it to outbuff */ ++cnt_nocompr; *out_idx++ = c; in_idx = ++anchor; ctrl_bits <<= 1; } /* save last load of control bits */ ctrl_bits <<= (16 - ctrl_cnt); *ctrl_idx = ctrl_bits; /* and return size of compressed buf